Board Game Geek is the most popular website aggregating all board games ever published, as well as all the addons. For many games when you do a Google Search, the BGG comes in TOP 5 results, frequently as the first spot. This means that the when someone tries to find out if a particular game is good, they frequently land on BGG. However, unlike other mediums (books, movies etc.) the reviews on BGG do not represent the generic population. Because the website is targeted to enthusiasts, a significant portion of grades are from only a (geeky) part of the gaming community.
Since reviews on BGG are mostly prepared by enthusiasts, but are used to drive buying decisions of the general population, we want to see what it takes to achieve a favorable position in the general ranking. Are some genres more favored (e.g. strategy/war games)? Does complexity matter? Are controversial games more, or less popular?
We start with the standard libraries (tidyverse, ggplot2) and add some interesting extensions.
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(ggrepel)
library(cowplot)
library(gganimate)
library(ggridges)
library(viridis)
library(ggtech) # For the airbnb theme
library(extrafont) # To add the font for Airbnb theme
library(GGally) # Grapihical correlogram
library(knitr) # Nice tables-kables
library(kableExtra)
# Download this fonts and install manually
# https://github.com/ricardo-bion/ggtech/blob/master/Circular%20Air-Medium%203.46.45%20PM.ttf
# https://github.com/ricardo-bion/ggtech/blob/master/Circular%20Air-Bold%203.46.45%20PM.ttf
# Import all fonts, takes a few minutes but removes the annoying warnings
# font_import()
# font_import(pattern = 'Circular', prompt=FALSE)
# loadfonts(device = "win")
# Lets set the theme to this nice one from Airbnb
# Colors: c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051")
theme_set(theme_airbnb_fancy())
pink = "#FF5A5F"
orange = "#FFB400"
blueGreen = "#007A87"
flesh = "#FFAA91"
purple = "#7B0051"
options(scipen=999) #avoiding e10 notation
Some facts about the dataset: - There are 17 000 board games scrapped from Board Game Geek. - All games have > 30 reviews. - 13 000 games are not included in the dataset, because they had less than 30 reviews. - the second dataset consists of 13 000 000 reviews, each with a number and optionally a comment. - Source: https://www.kaggle.com/jvanelteren/boardgamegeek-reviews - Collected in 2019.
reviews <- 'data/bgg-13m-reviews.csv' %>%
read_csv()
games <- 'data/games_detailed_info.csv' %>%
read_csv()
We need to drop some columns and rename others to make life easier with ggplot.
# Drop the index
reviews <- reviews %>%
select(-X1)
# Rename and drop some variables
games <- games %>%
select(
"Abstract_Rank" = "Abstract Game Rank",
"Rank" = "Board Game Rank",
"Childrens_Rank" = "Children's Game Rank",
"Family_Rank" = "Family Game Rank",
"Party_Rank" = "Party Game Rank",
"Strategy_Rank" = "Strategy Game Rank",
"Thematic_Rank" = "Thematic Rank",
"War_Game_Rank" = "War Game Rank",
"Ratings_average" = "average",
"Complexity" = "averageweight",
"Geekscore" = "bayesaverage",
"Category" = "boardgamecategory",
"Mechanics" = "boardgamemechanic",
'Publisher' = "boardgamepublisher",
'Description' = "description",
"id",
"image",
"maxplayers",
"maxplaytime",
"minage",
"minplayers",
"minplaytime",
"numcomments",
"numweights",
"owned",
"playingtime",
"name" = "primary",
"Ratings_std_dev" = "stddev",
"thumbnail",
"trading",
"usersrated",
"wanting",
"wishing",
"yearpublished"
)
There are some ‘joke’ games, e.g. a war game that takes 1 500 hours to complete (The Campaign for North Africa).
# Lets remove some outliers to make life easier
# Some outliers are ok, eg. for most popular games the "wishing" will be very high
games <- games %>%
filter(yearpublished > 1950,
maxplayers < 20,
maxplaytime < 600,
maxplaytime < 600,
playingtime < 600)
For each game, the participants can vote on the compelxity rating (from 1 to 5). These votes are later averaged and presented as a single score. We want to bin this variable, to make plotting easier. We dropped the 5th cattegory, becuase only a few games have a score of > 4.25.
games %>%
ggplot(aes(x = Complexity)) +
geom_histogram(bins = 60, color = orange, fill = orange, alpha = 0.5) +
labs(title = 'Distribution on average complexity', x = 'Complexity', y = 'Count')
# Values are from 1-5, but there is very little after 4.25
games$ComplexityBinned <- games$Complexity %>%
cut(
breaks = c(-Inf, 1.5, 2.5, 3.25, Inf),
labels = c(1, 2, 3, 4))
a <- games$ComplexityBinned %>%
table()
kable(a, col.names = c('Complexity rating', 'Number of games')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = F)
| Complexity rating | Number of games |
|---|---|
| 1 | 5487 |
| 2 | 6586 |
| 3 | 2992 |
| 4 | 1274 |
In addition to the main ranking, BGG proviedes additional ratings for some main categories. We want to see if there is some overlaping between this rankings.
# Create binary variables
# A game can have multiple main categories, so we can't create one column
games$isAbstract <- as.integer( !is.na(games$Abstract_Rank) )
games$isChildrens <- as.integer( !is.na(games$Childrens_Rank) )
games$isFamily <- as.integer( !is.na(games$Family_Rank) )
games$isParty <- as.integer( !is.na(games$Party_Rank) )
games$isStrategy <- as.integer( !is.na(games$Strategy_Rank) )
games$isThematic <- as.integer( !is.na(games$Thematic_Rank) )
games$isWarGame <- as.integer( !is.na(games$War_Game_Rank) )
# Some overlap between categories
a <- games %>%
group_by(isAbstract,
isChildrens,
isFamily,
isParty,
isStrategy,
isThematic,
isWarGame) %>%
summarise(n = n()) %>%
arrange(desc(n))
a[a == 0] <- ''
kable(a, col.names = c('Abstract',
'Children',
'Family',
'Party',
'Strategy',
'Thematic',
'War game',
'Number of games'),
align = 'c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
| Abstract | Children | Family | Party | Strategy | Thematic | War game | Number of games |
|---|---|---|---|---|---|---|---|
| 8094 | |||||||
| 1 | 2548 | ||||||
| 1 | 1247 | ||||||
| 1 | 1036 | ||||||
| 1 | 719 | ||||||
| 1 | 600 | ||||||
| 1 | 576 | ||||||
| 1 | 1 | 311 | |||||
| 1 | 302 | ||||||
| 1 | 1 | 209 | |||||
| 1 | 1 | 117 | |||||
| 1 | 1 | 108 | |||||
| 1 | 1 | 96 | |||||
| 1 | 1 | 88 | |||||
| 1 | 1 | 87 | |||||
| 1 | 1 | 70 | |||||
| 1 | 1 | 39 | |||||
| 1 | 1 | 33 | |||||
| 1 | 1 | 20 | |||||
| 1 | 1 | 10 | |||||
| 1 | 1 | 6 | |||||
| 1 | 1 | 5 | |||||
| 1 | 1 | 5 | |||||
| 1 | 1 | 1 | 4 | ||||
| 1 | 1 | 4 | |||||
| 1 | 1 | 1 | 1 | ||||
| 1 | 1 | 1 | |||||
| 1 | 1 | 1 | 1 | ||||
| 1 | 1 | 1 | |||||
| 1 | 1 | 1 | 1 |
There is - this creates a challange. How to choose a main category for each game?
For each game we create a quantile rank - 0.01 means that the game is in top 1% in the category.
# Convert category ranks to quantiles
games$quantileAbstract <- games %>%
.$Abstract_Rank / max(games$Abstract_Rank, na.rm = TRUE)
games$quantileChildrens <- games %>%
.$Childrens_Rank / max(games$Childrens_Rank, na.rm = TRUE)
games$quantileFamily <- games %>%
.$Family_Rank / max(games$Family_Rank, na.rm = TRUE)
games$quantileParty <- games %>%
.$Party_Rank / max(games$Party_Rank, na.rm = TRUE)
games$quantileStrategy <- games %>%
.$Strategy_Rank / max(games$Strategy_Rank, na.rm = TRUE)
games$quantileThematic <- games %>%
.$Thematic_Rank / max(games$Thematic_Rank, na.rm = TRUE)
games$quantileWarGame <- games %>%
.$War_Game_Rank / max(games$War_Game_Rank, na.rm = TRUE)
To find the main category of the game, we look at all the quantiles and choose the lowest one. Our logic is that it is much more significant if the game is 5th in ranking that has 100 games, than in one that has 1000.
# Find the primary category
# How? By the lowest quantile, eg. the most important in this catergory
find_lowest_quantile <- function(row, output) {
abstract = row["quantileAbstract"]
family = row["quantileChildrens"]
child = row["quantileFamily"]
party = row["quantileParty"]
strategy = row["quantileStrategy"]
thematic = row["quantileThematic"]
war = row["quantileWarGame"]
other = 1
lowest <- min(
abstract,
family,
child,
party,
strategy,
thematic,
war,
other,
na.rm = TRUE)
case_when(
lowest == abstract ~ 'Abstract',
lowest == family ~ 'Family',
lowest == child ~ "Children",
lowest == party ~ 'Party',
lowest == strategy ~ 'Strategy',
lowest == thematic ~ 'Thematic',
lowest == war ~ 'War Game',
lowest == 1 ~ 'Other'
) %>%
return()
}
# Takes a while
games$cat <- games %>%
apply(1, find_lowest_quantile) %>%
as.factor()
# A nice distribution, a lot of other
a <- games$cat %>% table()
kable(a, col.names = c('Category',
'Number of games')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
| Category | Number of games |
|---|---|
| Abstract | 852 |
| Children | 1453 |
| Family | 718 |
| Other | 8098 |
| Party | 445 |
| Strategy | 1249 |
| Thematic | 791 |
| War Game | 2733 |
This plot will help us choose next steps. From the graphs we can see, that in some situations we’ll need to use log(x).
my_dens <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_density_ridges(aes(y = games$ComplexityBinned, fill = games$ComplexityBinned), scale = 3, rel_min_height = 0.01, alpha = 0.7, color = NA ) +
scale_fill_manual(values = c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051"))
}
my_dots <- function(data, mapping, ...) {
ggplot(data = data, mapping = mapping) +
geom_point(alpha = 0.1) +
scale_color_tech(theme="airbnb")
}
games %>%
select( "Rank",
"Ratings_average",
"yearpublished",
"usersrated",
"playingtime",
"usersrated") %>%
ggpairs(
mapping = aes(color = games$ComplexityBinned, fill = games$ComplexityBinned),
diag = list(continuous = my_dens),
lower = list(continuous = my_dots),
progress = TRUE,
axisLabels = 'none',
legend = 1) +
theme(text = element_text(size = 10)) +
labs(title = '')
Definetly.
Compared to 2000, the number of games published in 2019 nearly tripled. The market has never been so big and so competetive.
games %>%
ggplot(aes(x = yearpublished)) +
geom_density(color = orange, fill = orange, alpha = 0.5) +
labs(title = 'Board games published per year', x = 'Year', y = 'Density')
It would certanly seem so. The ratings are quite stable for older games, but they start to increase if we go 3-4 years back. Notice an interesting pattern - the standard deviation of newer games is bigger. This may be due to the fact than newer games have fewer reviews. There also is an interesting ‘preorder hype’ effect. This dataset is from 2019, and games from 2020 have significantly higher scores.
plot <- games %>%
filter(yearpublished > 2009) %>%
ggplot(aes(x = Ratings_average)) +
geom_density(color = orange, fill = orange, alpha = 0.5) +
labs(title = 'Ratings of games published in: {floor(frame_time)}', x = 'Ratings', y = 'Density') +
transition_time(yearpublished) +
ease_aes('linear')
plot %>% animate(width = 900,
height = 400,
renderer = gifski_renderer())
Yes. Simple party games have an average score of 6, while the most complex games have a score of 7.5. This is a hint of a pattern that we’ll analise further.
games %>%
ggplot(aes(x = Ratings_average, y = ComplexityBinned, fill = ComplexityBinned)) +
geom_density_ridges(scale = 3, rel_min_height = 0.01, alpha = 0.7, color = NA) +
scale_fill_manual(values = c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051")) +
scale_x_continuous(limits = c(3, 9.3)) +
labs(title = 'Does the complexity of the game infuence the ratings?',
x = 'Ratings',
y = 'Complexity of the game',
fill = 'Complexity')
Rank vs rating has an interesting distribution - there are some clear patterns. It seems that for a high rank, the average rating has to be quite high. However, just having high reviews does not guarantee a good position. There is a very corious contraction near the ~ 6.0 average. This is a simmilar score to the scores of very low complexity games.
games %>%
ggplot(aes(x = Ratings_average, y = Rank)) +
geom_point(aes(colour = Rank), alpha = 0.2) +
scale_color_gradient(low = orange, high = purple) +
labs(title = 'Does the average rating influence the final rank?',
subtitle = 'Chicken Drumstick distribution',
x = 'Average rating',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
Log number of reviews is correlated with the rank, but there are some interesting points here: - The variance of rank decreases with the number of ratings. - There is a relatively small number of games that have a large number of reviews, but have poor position in the ranking. One theory is that these are the popular games like Monopoly, that do not have a good reputation in the more ‘geeky’ community
games %>%
ggplot(aes(x = log10(usersrated), y = Rank)) +
geom_point(aes(colour = Rank), alpha = 0.3) +
scale_color_gradient(low = orange, high = purple) +
labs(title = 'Does the number of user ratings influence the final rank?',
subtitle = 'Jaws distribution',
x = 'Log 10 of the number of ratings ',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
For all games – yes. Strategy is highly favored, while family games are disfavored. Generally speaking the choice of category will influence the rank in a significant way.
games$isTop <- ifelse(games$Rank < 300, 1, 0)
games %>%
ggplot(aes(x = Rank, fill = cat)) +
geom_density(alpha = 0.7, color = NA) +
facet_wrap('cat') +
scale_fill_manual(values = c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051", "#FF5A5F", "#FFB400", "#007A87")) +
theme(legend.position = 'none') +
labs(title = 'Does the main category of the game infuence the rank?',
subtitle = 'All games',
x = 'Rank (lower is better)',
y = 'Density')
The top 1% is dominated by strategy games. Other categories have only a few positions, with the exception of thematic games. Suprisingly, there are some childer games too.
games %>%
filter(isTop == 1) %>%
ggplot(aes(x = Rank, fill = cat)) +
geom_histogram(alpha = 0.7, color = NA, bins = 30) +
facet_wrap('cat') +
scale_fill_manual(values = c("#FF5A5F", "#FFB400", "#007A87", "#FFAA91", "#7B0051", "#FF5A5F", "#FFB400", "#007A87")) +
theme(legend.position = 'none') +
labs(title = 'Does the main category of the game infuence the rank?',
subtitle = 'Top 300 games',
x = 'Rank (lower is better)',
y = 'Count')
By this we mean people adding games to their wishlists. There is a strong correlation between these two variables. However, we beleive the causality here may be reversed. A high position in the ranking will boosts the popularity and sales of a game.
games %>%
filter(wishing > 10) %>%
ggplot(aes(x = log10(wishing), y = Rank)) +
geom_point(aes(colour = Rank), alpha = 0.3) +
geom_smooth(color = purple) +
scale_color_gradient(low = orange, high = purple) +
labs(title = 'Does the number of user wishlisting influence the final rank?',
subtitle = 'Removed obeservations with < 10 users',
x = 'Log 10 of the number of users wishlisting',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
Generally speaking, it doesn’t matter if the game has mostly textual reviews, or just numerical grades. There is some pattern in the very top part of the ranking. However, we beleive that this is due to the number of reviews.
games %>%
filter(numcomments / usersrated < 1) %>%
ggplot(aes(x = numcomments / usersrated , y = Rank)) +
geom_point(aes(colour = Rank), alpha = 0.5) +
scale_color_gradient(low = orange, high = purple) +
labs(title = '% of ratings with written reviews',
x = 'Number of comments / Number of ratings',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
We interpret the standard deviation of the ratings to represent how controversial the game is. Higher deviation means more disagreement. Perhaps suprisingly, this has no significant effect on the ranking.
games %>%
ggplot(aes(x = Ratings_std_dev, y = Rank))+
geom_point(aes(colour = Rank), alpha = 0.5) +
scale_color_gradient(low = orange, high = purple) +
labs(title = 'How controversial the game is',
x = 'Standard deviation of ratings',
y = 'Overall rank') +
theme(legend.position = 'none') +
scale_y_reverse()
games %>%
filter(Ratings_std_dev > 0.75,
Ratings_std_dev < 2.5) %>%
ggplot(aes(x = Ratings_std_dev, y = Rank)) +
stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
scale_fill_gradient(high = orange, low = purple) +
labs(title = 'How controversial the game is',
subtitle = 'Outliers removed',
x = 'Standard deviation of ratings',
y = 'Overall rank') +
scale_y_reverse()
Scale for 'y' is already present. Adding another scale for 'y', which will
replace the existing scale.